Stabilize diskless no-drop replication test by sarthakaggarwal97 · Pull Request #3511 · valkey-io/valkey

sarthakaggarwal97 · 2026-04-15T04:16:16Z

This deflakes all variants of diskless replicas drop during rdb pipe.

The main issue turned out to be that the test was too sensitive to timing and log ordering under TLS, not that the core behavior was wrong. This keeps the same five subcases (no, slow, fast, all, timeout) but makes them much less CI-fragile.

CI passes 200 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

codecov · 2026-04-15T05:57:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.38%. Comparing base (6444717) to head (fe1d7e6).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3511      +/-   ##
============================================
- Coverage     76.58%   76.38%   -0.20%     
============================================
  Files           159      159              
  Lines         80019    80019              
============================================
- Hits          61283    61125     -158     
- Misses        18736    18894     +158

see 28 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Nikhil-Manglore · 2026-04-15T06:04:56Z

Did this test fail recently, I know viktor attempted to fix it in this PR: #3461

Edit: I just saw the comments on the PR, looks like it did fail again recently.

zuiderkwast

Looks pretty good. Only the "no replicas drop" case is covered, so the other cases can still have timing issues, such as "fast" and "slow" cases? I see there are some special cases for the other cases, for example for "timeout" there is pause_process as well. Do you have a full picture?

sarthakaggarwal97 · 2026-04-16T16:50:04Z

@zuiderkwast 100 runs for all the variants passed: https://github.com/sarthakaggarwal97/valkey/actions/runs/24521848934

sarthakaggarwal97 · 2026-04-17T00:40:37Z

@zuiderkwast I think it still deflakes the tests a lot, but I am afraid out of 500, I still see 1-5 flaky runs.

sarthakaggarwal97 · 2026-04-17T04:39:12Z

This version is quite stable. Not seeing failures anymore.

zuiderkwast

Very good! Sorry for the delay.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

The 'slow' subcase fallback was searching for a log message matching '*Connection with replica client id * lost.*' but the actual server log format is 'Connection with replica <host>:<port> lost.' — there is no 'client id' in the message. Fix the glob to '*Connection with replica * lost.*'. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

On fast ARM CI runners the RDB transfer completes before both replicas are killed, so the primary logs '2 replicas still up' instead of 'last replica dropped' or '1 replicas still up'. Add a nested catch fallback to accept all three possible outcomes. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Nikhil-Manglore

LGTM

sarthakaggarwal97 · 2026-04-21T20:03:15Z

The test failures are unrelated to this change. Related to #2936

This deflakes all variants of `diskless replicas drop during rdb pipe`. The main issue turned out to be that the test was too sensitive to timing and log ordering under TLS, not that the core behavior was wrong. This keeps the same five subcases (no, slow, fast, all, timeout) but makes them much less CI-fragile. CI passes 200 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515 --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com> Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

sarthakaggarwal97 mentioned this pull request Apr 15, 2026

[Flaky Tests] Avoid re-triggering io-thread activation #3509

Merged

github-actions Bot assigned sarthakaggarwal97 Apr 15, 2026

sarthakaggarwal97 requested a review from zuiderkwast April 15, 2026 16:34

zuiderkwast reviewed Apr 16, 2026

View reviewed changes

Comment thread tests/integration/replication.tcl Outdated

Comment thread tests/integration/replication.tcl Outdated

Comment thread tests/integration/replication.tcl Outdated

sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 14afd0b to 5a24b23 Compare April 16, 2026 16:45

sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 5a24b23 to 17730b5 Compare April 16, 2026 18:29

asagege mentioned this pull request Apr 16, 2026

Test repl fix to see if test-ubuntu-latest-cmake-tls passes (forkless) #3523

Closed

sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 56ae8dc to 1e467b0 Compare April 17, 2026 04:04

madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 17, 2026

zuiderkwast approved these changes Apr 21, 2026

View reviewed changes

Comment thread tests/integration/replication.tcl

sarthakaggarwal97 and others added 7 commits April 21, 2026 09:54

Stabilize diskless no-drop replication test

9901c78

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Stabilize diskless no-drop replication test

43931e3

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Deflake diskless replication pipe test

37970e3

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

Give diskless no-drop replicas more reconnect time

cc331be

Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

tests: simplify paused replica pid handling

fe1d7e6

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

sarthakaggarwal97 force-pushed the daily-repl-rdb-child-timeout-20260414 branch from 9a86d16 to fe1d7e6 Compare April 21, 2026 16:58

Nikhil-Manglore approved these changes Apr 21, 2026

View reviewed changes

sarthakaggarwal97 mentioned this pull request Apr 21, 2026

Add zmalloc_aligned() and fix SPMC queue buffer alignment #3504

Merged

zuiderkwast merged commit 03c2d4c into valkey-io:unstable Apr 21, 2026
63 of 66 checks passed

sarthakaggarwal97 mentioned this pull request Apr 23, 2026

Backport Unstable to 9.1 for RC2 #3519

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize diskless no-drop replication test#3511

Stabilize diskless no-drop replication test#3511
zuiderkwast merged 7 commits intovalkey-io:unstablefrom
sarthakaggarwal97:daily-repl-rdb-child-timeout-20260414

sarthakaggarwal97 commented Apr 15, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Nikhil-Manglore commented Apr 15, 2026 •

edited

Loading

Uh oh!

zuiderkwast left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 16, 2026 •

edited

Loading

Uh oh!

sarthakaggarwal97 commented Apr 17, 2026 •

edited

Loading

Uh oh!

sarthakaggarwal97 commented Apr 17, 2026

Uh oh!

zuiderkwast left a comment

Uh oh!

Uh oh!

Nikhil-Manglore left a comment

Uh oh!

sarthakaggarwal97 commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sarthakaggarwal97 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Nikhil-Manglore commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 17, 2026

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Nikhil-Manglore left a comment

Choose a reason for hiding this comment

Uh oh!

sarthakaggarwal97 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sarthakaggarwal97 commented Apr 15, 2026 •

edited

Loading

codecov Bot commented Apr 15, 2026 •

edited

Loading

Nikhil-Manglore commented Apr 15, 2026 •

edited

Loading

sarthakaggarwal97 commented Apr 16, 2026 •

edited

Loading

sarthakaggarwal97 commented Apr 17, 2026 •

edited

Loading

sarthakaggarwal97 commented Apr 21, 2026 •

edited

Loading